Credit Risk Modelling Challenge
7/19/23
👇
This is a supervised binary classification problem.
For the analyses, we start by holding back a testing set with initial_split(). The remaining data are split into training and validation sets:
<Training/Validation/Total>
<600/150/750>
crd_rec <- recipe(default ~ .,
data = analysis(crd_val$splits[[1]])) %>%
# Now add preprocessing steps to the recipe:
step_impute_knn(all_predictors()) %>%
step_zv(all_numeric_predictors()) %>%
step_orderNorm(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_spatialsign(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_other(all_nominal_predictors()) %>%
step_filter_missing(all_nominal_predictors(), threshold = 0)
crd_rec_trained <-
crd_rec %>%
prep(log_changes = TRUE)step_impute_knn (impute_knn_xog4f): same number of columns
step_zv (zv_x4j7U): same number of columns
step_orderNorm (orderNorm_ASE8C): same number of columns
step_normalize (normalize_UdsR8): same number of columns
step_spatialsign (spatialsign_1scU2): same number of columns
step_dummy (dummy_yWhfj):
new (29): credit_history_delayed, credit_history_fully.repaid, ...
removed (10): credit_history, purpose, personal_status, other_debtors, ...
step_other (other_tAQXZ): same number of columns
step_filter_missing (filter_missing_29F6e): same number of columns
show the histograms of the
amountpredictor before and after the recipe was prepared:
UMAP is similar to the popular
t-SNEmethod for nonlinear dimension reduction
Both the PLS and UMAP methods are worth investigating in conjunction with different models.
# single-layer neural network
mlp_spec <-
mlp(hidden_units = tune(),
penalty = tune(),
epochs = tune()) %>%
set_engine("nnet") %>%
set_mode("classification")
# bagged trees
bagging_spec <-
bag_tree(cost_complexity = tune(),
tree_depth = tune(),
min_n = tune(),
class_cost = tune()) %>%
set_engine("rpart") %>%
set_mode("classification")best_res <-
crd_res %>%
extract_workflow("pls_mlp") %>%
finalize_workflow(
crd_res %>%
extract_workflow_set_result("pls_mlp") %>%
select_best(metric = "roc_auc")
) %>%
last_fit(split = credit_split, metrics = metric_set(roc_auc))
best_wflow_fit <- best_res$.workflow[[1]]
extract_fit_parsnip(best_wflow_fit)parsnip model object
a 3-1-1 network with 6 weights
inputs: PLS1 PLS2 PLS3
output(s): ..y
options were - entropy fitting decay=0.00000000024
Below is a plot of the variable importance. The importance of the features in the final model is represented visually.
Global explainer for the classification ML tidymodel on the credit data
From the variable importance plot,
checking_balanceis the least important feature in the final model whereas thejob,amount, andexisting_creditsfeatures are the top 3 most important among the selected features in the final model.
https://iamolumide.quarto.pub/credit-risk-model-presentation/